predup youtube playlists and tabs, and more...#374
Draft
galgeek wants to merge 85 commits intointernetarchive:masterfrom
Draft
predup youtube playlists and tabs, and more...#374galgeek wants to merge 85 commits intointernetarchive:masterfrom
galgeek wants to merge 85 commits intointernetarchive:masterfrom
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR adds functionality for processing YouTube playlists and tabs.
- Introduces a new database query in brozzler/ydl.py to capture video URLs via psycopg.
- Updates model and schema files to propagate an account_id field.
- Adds a psycopg dependency in pyproject.toml to support the new database operations.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| pyproject.toml | Adds psycopg dependency for database connectivity |
| brozzler/ydl.py | Adds get_video_captures function and updates YouTube URL handling |
| brozzler/model.py | Propagates account_id to Site objects in job creation |
| brozzler/job_schema.yaml | Updates schema to include an optional account_id field |
Comments suppressed due to low confidence (1)
brozzler/ydl.py:428
- Using string concatenation with '+' in the SQL LIKE clause is not standard for PostgreSQL. Consider using the concatenation operator '||' (e.g., "... containing_page_url like '%' || %s || '%'") or the CONCAT function.
pg_query = ("SELECT containing_page_url from video where account_id = %s and seed = %s and containing_page_url like '%'+%s+'%'", (account_id, seed, source,))
96aeeb7 to
b5258f4
Compare
avdempsey
reviewed
Jun 23, 2025
avdempsey
reviewed
Jun 23, 2025
avdempsey
reviewed
Jun 24, 2025
Comment on lines
+423
to
+457
| def get_video_captures(site, source="youtube"): | ||
| if not VIDEO_DATA_SOURCE: | ||
| return None | ||
|
|
||
| if VIDEO_DATA_SOURCE and VIDEO_DATA_SOURCE.startswith("postgresql"): | ||
| import psycopg | ||
|
|
||
| account_id = site.account_id if site.account_id else None | ||
| seed = site.metadata.ait_seed_id if site.metadata.ait_seed_id else None | ||
| if source == "youtube": | ||
| containing_page_url_pattern = "http://youtube.com/watch" # yes, video data canonicalization uses "http" | ||
| # support other sources here | ||
| else: | ||
| containing_page_url_pattern = None | ||
| if account_id and seed and source: | ||
| pg_query = ( | ||
| "SELECT distinct(containing_page_url) from video where account_id = %s and seed = %s and containing_page_url like %s", | ||
| ( | ||
| account_id, | ||
| seed, | ||
| containing_page_url_pattern, | ||
| ), | ||
| ) | ||
| elif seed and source: | ||
| pg_query = ( | ||
| "SELECT containing_page_url from video where seed = %s and containing_page_url like %s", | ||
| (seed, containing_page_url_pattern), | ||
| ) | ||
| else: | ||
| return None | ||
| with psycopg.connect(VIDEO_DATA_SOURCE) as conn: | ||
| with conn.cursor(row_factory=psycopg.rows.scalar_row) as cur: | ||
| cur.execute(pg_query) | ||
| return cur.fetchall() | ||
| return None |
Contributor
There was a problem hiding this comment.
We should consider wrapping this call with an interface as a client
- This would allow us to have the client use pg directly or make an API call
- A client could maintain connections instead re-connecting to PG on every check
- We need to add unit tests before we ship this feature, a client might make unit testing easier
Contributor
Author
There was a problem hiding this comment.
recent updates include most of this work
avdempsey
reviewed
Jun 24, 2025
avdempsey
reviewed
Jun 24, 2025
46a44cc to
46e17c1
Compare
avdempsey
reviewed
Jul 1, 2025
avdempsey
reviewed
Jul 1, 2025
added 13 commits
July 14, 2025 16:17
added 13 commits
July 29, 2025 12:13
fe07f55 to
f22fa8f
Compare
f22fa8f to
6a8c62f
Compare
79fd88d to
8abb9cd
Compare
added 12 commits
August 15, 2025 14:17
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.